feat: add EvaluationClient with run() for on-demand session evaluation by jariy17 · Pull Request #300 · aws/bedrock-agentcore-sdk-python

jariy17 · 2026-03-06T22:02:25Z

Summary

Add EvaluationClient with run() method that collects spans from CloudWatch and calls the evaluate API with level-aware batching (SESSION/TRACE/TOOL_CALL)
Add internal _agent_span_collector package with CloudWatchAgentSpanCollector for span collection with retry/polling
Add optional query_string and end_time parameters to CloudWatchSpanHelper to support collector delegation

Details

run() accepts evaluator_ids, session_id, and agent_id or log_group_name
Auto-derives log group as /aws/bedrock-agentcore/runtimes/{agent_id}-DEFAULT
CloudWatch query filters by attributes.session.id + ispresent(scope.name)
Auto-batches evaluate requests (max 10 target IDs per request)
Caches evaluator level lookups via control plane
Operational logging at INFO/DEBUG levels for debugging

Test plan

Unit tests: python -m pytest tests/bedrock_agentcore/evaluation/test_client.py -v (35 tests)
Full evaluation suite: python -m pytest tests/bedrock_agentcore/evaluation/ -v (111 tests)
Manual integration test with real agent (see PR comment for test script)

jariy17 · 2026-03-06T22:07:47Z

Manual Integration Test Script

Save as test_client_real.py at repo root and run with python test_client_real.py. Requires AWS credentials with access to the HealthcareAgent runtime and CloudWatch.

This test invokes 20 turns to trigger batching (>10 trace IDs), waits 180s for CW ingestion, then runs EvaluationClient.run().

"""Temporary real test for EvaluationClient.run() batching — delete after testing."""

import json
import logging
import time
import uuid

import boto3

from bedrock_agentcore.evaluation import EvaluationClient

logging.basicConfig(level=logging.DEBUG)
logging.getLogger("botocore").setLevel(logging.WARNING)
logging.getLogger("boto3").setLevel(logging.WARNING)
logging.getLogger("urllib3").setLevel(logging.WARNING)

AGENT_ARN = "arn:aws:bedrock-agentcore:us-west-2:363376058968:runtime/HealthcareAgent_HealthCareAgent-Pv2decFQqQ"
AGENT_ID = "HealthcareAgent_HealthCareAgent-Pv2decFQqQ"
REGION = "us-west-2"


def invoke_agent(session_id: str, prompt: str) -> str:
    dp_client = boto3.client("bedrock-agentcore", region_name=REGION)
    payload = json.dumps({"prompt": prompt}).encode()
    response = dp_client.invoke_agent_runtime(
        agentRuntimeArn=AGENT_ARN, runtimeSessionId=session_id, payload=payload,
    )
    raw_output = response["response"].read().decode("utf-8")
    text_parts = []
    for line in raw_output.splitlines():
        if line.startswith("data: "):
            chunk = line[len("data: "):]
            if chunk.startswith('"') and chunk.endswith('"'):
                chunk = json.loads(chunk)
            text_parts.append(chunk)
    return "".join(text_parts) if text_parts else raw_output


TURNS = [
    "What are the symptoms of the flu?",
    "How is the flu treated?",
    "When should I see a doctor for the flu?",
    "What causes high blood pressure?",
    "What are the symptoms of diabetes?",
    "How is type 2 diabetes diagnosed?",
    "What are common treatments for asthma?",
    "What causes migraines?",
    "How can I prevent heart disease?",
    "What are the side effects of ibuprofen?",
    "What is the difference between a cold and the flu?",
    "How does pneumonia spread?",
    "What vaccines do adults need?",
    "What are the early signs of arthritis?",
    "How is strep throat diagnosed?",
    "What causes kidney stones?",
    "How can I lower my cholesterol naturally?",
    "What are the symptoms of anemia?",
    "How is a urinary tract infection treated?",
    "What are the warning signs of a stroke?",
]


def main():
    session_id = f"test-batch-{uuid.uuid4()}"
    print(f"Session ID: {session_id}")
    print(f"Turns: {len(TURNS)}")

    for i, prompt in enumerate(TURNS):
        print(f"\n  Turn {i+1}/20: {prompt}")
        response = invoke_agent(session_id, prompt)
        print(f"  Response: {response[:150]}...")

    print(f"\n--- Waiting 180s for spans to land in CloudWatch ---")
    time.sleep(180)

    print(f"\n{'='*60}")
    print(f"Running EvaluationClient.run()")
    print(f"{'='*60}")
    client = EvaluationClient(region_name=REGION)
    results = client.run(
        evaluator_ids=["Builtin.Helpfulness"],
        session_id=session_id,
        agent_id=AGENT_ID,
    )

    print(f"\n--- Results ({len(results)} total) ---")
    for r in results:
        print(json.dumps(r, indent=4, default=str))


if __name__ == "__main__":
    main()

Expected output

163 spans collected
Evaluator resolved to TRACE level
Split into 2 batched requests (20 trace IDs > max 10 per request)
20 evaluation results, each scored ~0.83 ("Very Helpful")

src/bedrock_agentcore/evaluation/_agent_span_collector/agent_span_collector.py

aidandaly24

Looks good to me very clean PR. Two small nit comments, but approved.

src/bedrock_agentcore/evaluation/_agent_span_collector/agent_span_collector.py

EvaluationClient collects spans from CloudWatch and calls the evaluate API with level-aware batching (SESSION/TRACE/TOOL_CALL). Accepts evaluator_ids, session_id, and agent_id or log_group_name. Auto-derives log group from agent_id, caches evaluator level lookups, and batches evaluate requests at max 10 target IDs per request.

aidandaly24 · 2026-03-09T16:41:19Z

src/bedrock_agentcore/evaluation/client.py

+        for evaluator_id in evaluator_ids:
+            level = self._get_evaluator_level(evaluator_id)
+            logger.info("Evaluating with %s (level=%s)", evaluator_id, level)
+            requests = self._build_requests_for_level(evaluator_id, level, base_input, spans)


_build_requests_for_level raises ValueError when spans have no trace/tool IDs, but that exception isn't caught here — only the evaluate() call below is wrapped in try/except. So a TRACE evaluator with no trace IDs crashes the entire run(), while an API error just logs a warning and continues to the next evaluator. Could we wrap this call in the same try/except, or have _build_requests_for_level return [] + log a warning instead of raising?

You're right, I'm just remove the try and catch from the for loop so if anything fails, the function errors out instead of swallowing the error.

Remove try/except around evaluate() so errors propagate to the caller instead of being silently swallowed. Simplify _extract_trace_ids with dict.fromkeys(), inline _batch() into list comprehensions, and remove the evaluator_result_count tracking variable.

aidandaly24

thanks for making the changes looks good to me

jariy17 requested a review from a team March 6, 2026 22:02

jariy17 temporarily deployed to auto-approve March 6, 2026 22:02 — with GitHub Actions Inactive

jariy17 force-pushed the feat/evaluation_client branch from e6b25d2 to 5615fb0 Compare March 6, 2026 22:29

jariy17 temporarily deployed to auto-approve March 6, 2026 22:30 — with GitHub Actions Inactive

jariy17 force-pushed the feat/evaluation_client branch from 5615fb0 to 181b396 Compare March 9, 2026 13:56

jariy17 temporarily deployed to auto-approve March 9, 2026 13:56 — with GitHub Actions Inactive

aidandaly24 reviewed Mar 9, 2026

View reviewed changes

src/bedrock_agentcore/evaluation/_agent_span_collector/agent_span_collector.py Show resolved Hide resolved

aidandaly24 previously approved these changes Mar 9, 2026

View reviewed changes

src/bedrock_agentcore/evaluation/_agent_span_collector/agent_span_collector.py Show resolved Hide resolved

jariy17 dismissed aidandaly24’s stale review via 1cf5d15 March 9, 2026 16:06

jariy17 force-pushed the feat/evaluation_client branch from 181b396 to 1cf5d15 Compare March 9, 2026 16:06

jariy17 temporarily deployed to auto-approve March 9, 2026 16:06 — with GitHub Actions Inactive

jariy17 force-pushed the feat/evaluation_client branch from 1cf5d15 to ea3a1d0 Compare March 9, 2026 16:18

jariy17 temporarily deployed to auto-approve March 9, 2026 16:18 — with GitHub Actions Inactive

jariy17 force-pushed the feat/evaluation_client branch from ea3a1d0 to 5f2c473 Compare March 9, 2026 16:22

jariy17 temporarily deployed to auto-approve March 9, 2026 16:22 — with GitHub Actions Inactive

jariy17 force-pushed the feat/evaluation_client branch from 5f2c473 to d917f8b Compare March 9, 2026 16:24

jariy17 temporarily deployed to auto-approve March 9, 2026 16:24 — with GitHub Actions Inactive

aidandaly24 reviewed Mar 9, 2026

View reviewed changes

jariy17 temporarily deployed to auto-approve March 9, 2026 17:06 — with GitHub Actions Inactive

style: fix trailing newlines from ruff format

599828a

jariy17 temporarily deployed to auto-approve March 9, 2026 17:10 — with GitHub Actions Inactive

aidandaly24 approved these changes Mar 9, 2026

View reviewed changes

aidandaly24 merged commit 102ba0d into main Mar 9, 2026
20 checks passed

jariy17 deleted the feat/evaluation_client branch March 9, 2026 18:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add EvaluationClient with run() for on-demand session evaluation#300

feat: add EvaluationClient with run() for on-demand session evaluation#300
aidandaly24 merged 3 commits intomainfrom
feat/evaluation_client

jariy17 commented Mar 6, 2026

Uh oh!

jariy17 commented Mar 6, 2026

Uh oh!

Uh oh!

aidandaly24 left a comment

Uh oh!

Uh oh!

aidandaly24 Mar 9, 2026

Uh oh!

jariy17 Mar 9, 2026

Uh oh!

aidandaly24 left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jariy17 commented Mar 6, 2026

Summary

Details

Test plan

Uh oh!

jariy17 commented Mar 6, 2026

Manual Integration Test Script

Expected output

Uh oh!

Uh oh!

aidandaly24 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

aidandaly24 Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

jariy17 Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

aidandaly24 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants